Can you get rich by doing machine learning?¶

*——Let data help you become a talent in the field of machine learning*¶

1. Introduction¶

Machine learning is a typical multi-field interdisciplinary subject and is also a popular direction for our future employment. It is developing rapidly and has good development prospects, which makes it extremely attractive to us.

But I believe everyone has discovered a problem, that is, what is the relationship between the courses we learn every day and the work we will do in the future? We learn programming languages, databases, various model concepts, and even big data, trying to develop ourselves into all-round talents, but what kind of content can be useful in the end?

Becoming a talent cannot ignore practical issues. I believe everyone is also curious about how much money you can earn after becoming a talent in the field. So we decided to use this "insider" data set to dig into the questions that everyone is curious about.

2. Data loading and data information display¶

2-1 Importing third-party libraries¶

In [1]:
 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U plotly
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pip in d:\anaconda\lib\site-packages (22.3.1)
Requirement already satisfied: plotly in d:\anaconda\lib\site-packages (5.11.0)
Requirement already satisfied: tenacity>=6.2.0 in d:\anaconda\lib\site-packages (from plotly) (8.1.0)
Note: you may need to restart the kernel to use updated packages.
In [13]:
pip install missingno
Collecting missingno
  Downloading missingno-0.5.2-py3-none-any.whl.metadata (639 bytes)
Requirement already satisfied: numpy in f:\anaconda\lib\site-packages (from missingno) (1.26.4)
Requirement already satisfied: matplotlib in f:\anaconda\lib\site-packages (from missingno) (3.8.4)
Requirement already satisfied: scipy in f:\anaconda\lib\site-packages (from missingno) (1.13.1)
Requirement already satisfied: seaborn in f:\anaconda\lib\site-packages (from missingno) (0.13.2)
Requirement already satisfied: contourpy>=1.0.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (1.2.0)
Requirement already satisfied: cycler>=0.10 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (1.4.4)
Requirement already satisfied: packaging>=20.0 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (23.2)
Requirement already satisfied: pillow>=8 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (2.9.0.post0)
Requirement already satisfied: pandas>=1.2 in f:\anaconda\lib\site-packages (from seaborn->missingno) (2.2.2)
Requirement already satisfied: pytz>=2020.1 in f:\anaconda\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in f:\anaconda\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2023.3)
Requirement already satisfied: six>=1.5 in f:\anaconda\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB)
Installing collected packages: missingno
Successfully installed missingno-0.5.2
Note: you may need to restart the kernel to use updated packages.
In [9]:
import warnings
warnings.filterwarnings('ignore') 
In [14]:
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import LabelEncoder
import missingno as msno
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.tree import DecisionTreeRegressor , ExtraTreeRegressor

2-2 Reading Data¶

First, we need to have a general idea of ​​the data set, see how many people in the industry filled out the questionnaire, and see what questions everyone was asked.

In [17]:
data = pd.read_csv('kaggle_survey_2022_responses.csv')

2-3 View basic data information¶

In [20]:
print("data shape =",data.shape)
data shape = (23998, 296)
In [22]:
print("Show the first five rows:")
data.head()
Show the first five rows:
Out[22]:
Duration (in seconds) Q2 Q3 Q4 Q5 Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 ... Q44_3 Q44_4 Q44_5 Q44_6 Q44_7 Q44_8 Q44_9 Q44_10 Q44_11 Q44_12
0 Duration (in seconds) What is your age (# years)? What is your gender? - Selected Choice In which country do you currently reside? Are you currently a student? (high school, uni... On which platforms have you begun or completed... On which platforms have you begun or completed... On which platforms have you begun or completed... On which platforms have you begun or completed... On which platforms have you begun or completed... ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ... Who/what are your favorite media sources that ...
1 121 30-34 Man India No NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 462 30-34 Man Algeria No NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 293 18-21 Man Egypt Yes Coursera edX NaN DataCamp NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... Podcasts (Chai Time Data Science, O’Reilly Dat... NaN NaN NaN NaN NaN
4 851 55-59 Man France No Coursera NaN Kaggle Learn Courses NaN NaN ... NaN Kaggle (notebooks, forums, etc) Course Forums (forums.fast.ai, Coursera forums... NaN NaN Blogs (Towards Data Science, Analytics Vidhya,... NaN NaN NaN NaN

5 rows × 296 columns

In [27]:
print("Last five rows:")
data.tail()
Last five rows:
Out[27]:
Duration (in seconds) Q2 Q3 Q4 Q5 Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 ... Q44_3 Q44_4 Q44_5 Q44_6 Q44_7 Q44_8 Q44_9 Q44_10 Q44_11 Q44_12
23993 331 22-24 Man United States of America Yes NaN NaN NaN NaN NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... Podcasts (Chai Time Data Science, O’Reilly Dat... NaN Journal Publications (peer-reviewed journals, ... NaN NaN NaN
23994 330 60-69 Man United States of America Yes NaN NaN NaN NaN NaN ... NaN NaN NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23995 860 25-29 Man Turkey No NaN NaN NaN DataCamp NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23996 597 35-39 Woman Israel No NaN NaN Kaggle Learn Courses NaN NaN ... NaN NaN NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23997 303 18-21 Man India Yes NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Other

5 rows × 296 columns

In [29]:
data.dtypes
Out[29]:
Duration (in seconds)    object
Q2                       object
Q3                       object
Q4                       object
Q5                       object
                          ...  
Q44_8                    object
Q44_9                    object
Q44_10                   object
Q44_11                   object
Q44_12                   object
Length: 296, dtype: object
In [31]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23998 entries, 0 to 23997
Columns: 296 entries, Duration (in seconds) to Q44_12
dtypes: object(296)
memory usage: 54.2+ MB
In [33]:
#Statistical information of the data set
#count-number of non-null values ​​in each column unique-number of different values ​​top-option with the most occurrences freq-frequency
data.describe()
Out[33]:
Duration (in seconds) Q2 Q3 Q4 Q5 Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 ... Q44_3 Q44_4 Q44_5 Q44_6 Q44_7 Q44_8 Q44_9 Q44_10 Q44_11 Q44_12
count 23998 23998 23998 23998 23998 9700 2475 6629 3719 945 ... 2679 11182 4007 11958 2121 7767 3805 1727 1 836
unique 4329 12 6 59 3 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 1 2
top 230 18-21 Man India No Coursera edX Kaggle Learn Courses DataCamp Fast.ai ... Reddit (r/machinelearning, etc) Kaggle (notebooks, forums, etc) Course Forums (forums.fast.ai, Coursera forums... YouTube (Kaggle YouTube, Cloud AI Adventures, ... Podcasts (Chai Time Data Science, O’Reilly Dat... Blogs (Towards Data Science, Analytics Vidhya,... Journal Publications (peer-reviewed journals, ... Slack Communities (ods.ai, kagglenoobs, etc) Who/what are your favorite media sources that ... Other
freq 59 4559 18266 8792 12036 9699 2474 6628 3718 944 ... 2678 11181 4006 11957 2120 7766 3804 1726 1 835

4 rows × 296 columns

3. Data analysis¶

Then we came to the data analysis module. The redundant data made people dizzy. If we want to find the content we need, we need to visualize the data to see the information more intuitively.

In [37]:
data_1 = data.drop([0])
data_1
Out[37]:
Duration (in seconds) Q2 Q3 Q4 Q5 Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 ... Q44_3 Q44_4 Q44_5 Q44_6 Q44_7 Q44_8 Q44_9 Q44_10 Q44_11 Q44_12
1 121 30-34 Man India No NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 462 30-34 Man Algeria No NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 293 18-21 Man Egypt Yes Coursera edX NaN DataCamp NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... Podcasts (Chai Time Data Science, O’Reilly Dat... NaN NaN NaN NaN NaN
4 851 55-59 Man France No Coursera NaN Kaggle Learn Courses NaN NaN ... NaN Kaggle (notebooks, forums, etc) Course Forums (forums.fast.ai, Coursera forums... NaN NaN Blogs (Towards Data Science, Analytics Vidhya,... NaN NaN NaN NaN
5 232 45-49 Man India Yes NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN Blogs (Towards Data Science, Analytics Vidhya,... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23993 331 22-24 Man United States of America Yes NaN NaN NaN NaN NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... Podcasts (Chai Time Data Science, O’Reilly Dat... NaN Journal Publications (peer-reviewed journals, ... NaN NaN NaN
23994 330 60-69 Man United States of America Yes NaN NaN NaN NaN NaN ... NaN NaN NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23995 860 25-29 Man Turkey No NaN NaN NaN DataCamp NaN ... NaN Kaggle (notebooks, forums, etc) NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23996 597 35-39 Woman Israel No NaN NaN Kaggle Learn Courses NaN NaN ... NaN NaN NaN YouTube (Kaggle YouTube, Cloud AI Adventures, ... NaN NaN NaN NaN NaN NaN
23997 303 18-21 Man India Yes NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Other

23997 rows × 296 columns

3-1 Distribution map of personal information¶

3-1-1 Age (Q2) distribution¶

In the computer field, age is capital; the younger you are, the more miracles you can create.

In [42]:
Num_Q2 = data_1["Q2"].value_counts()
print(Num_Q2)

#Presented in pie chart form
#hole: Set the hollow radius ratio [0,1]
#template: Canvas style, there are several options: ggplot2, seaborn, simple_white, plotly, plotly_white
fig=px.pie(values=Num_Q2.values, names=Num_Q2.index, hole=0.7, template='plotly_white') 

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                         showarrow = False, font_size=22,
                        text="<b>Age</b>"))

fig.show()
Q2
18-21    4559
25-29    4472
22-24    4283
30-34    2972
35-39    2353
40-44    1927
45-49    1253
50-54     914
55-59     611
60-69     526
70+       127
Name: count, dtype: int64

3-1-2 Gender (Q3) distribution¶

Gender: In the computer field, men do make up the majority, but women can create immeasurable value just like men.

In [46]:
Num_Q3 = data_1["Q3"].value_counts()
print(Num_Q3)

fig=px.pie(values=Num_Q3.values, names=Num_Q3.index, hole=0.7, template='ggplot2') 

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                         showarrow = False, font_size=22,
                        text="<b>Gender</b>"))

fig.show()
Q3
Man                        18266
Woman                       5286
Prefer not to say            334
Nonbinary                     78
Prefer to self-describe       33
Name: count, dtype: int64

3-1-3 Regional (Q4) distribution¶

Kaggle is a very successful platform. People all over the world use Kaggle for learning and communication. Knowledge can only spark greater sparks through collision.

In [50]:
Num_Q4 = data_1["Q4"].value_counts()
print(Num_Q4)#Series

print("---------------------------------------------------------------------")

Num_Q42=pd.DataFrame({'Country':Num_Q4.index,'Count':Num_Q4.values})
print(Num_Q42)

#There are too many countries, so they are presented in the form of a histogram
#x, y: horizontal and vertical coordinates color: used to add legends template: canvas style text: display numbers title: title
fig=px.bar(Num_Q42, x='Country', y='Count', color='Country', template='seaborn', text='Count', title='<b>Country')

fig.show()
Q4
India                                                   8792
United States of America                                2920
Other                                                   1430
Brazil                                                   833
Nigeria                                                  731
Pakistan                                                 620
Japan                                                    556
China                                                    453
Egypt                                                    383
Mexico                                                   380
Indonesia                                                376
Turkey                                                   345
Russia                                                   324
South Korea                                              317
France                                                   262
United Kingdom of Great Britain and Northern Ireland     258
Spain                                                    257
Canada                                                   257
Colombia                                                 256
Bangladesh                                               251
Taiwan                                                   242
Viet Nam                                                 212
Argentina                                                204
Kenya                                                    201
Italy                                                    182
Morocco                                                  177
Australia                                                142
Thailand                                                 132
Tunisia                                                  125
Peru                                                     121
Iran, Islamic Republic of...                             120
Chile                                                    115
Poland                                                   113
South Africa                                             109
Philippines                                              108
Netherlands                                              108
Ghana                                                    107
Israel                                                   102
Germany                                                   99
Ethiopia                                                  98
United Arab Emirates                                      94
Portugal                                                  87
Saudi Arabia                                              84
Ukraine                                                   79
Sri Lanka                                                 77
Nepal                                                     75
Malaysia                                                  74
Singapore                                                 68
Cameroon                                                  68
Algeria                                                   62
Hong Kong (S.A.R.)                                        58
Zimbabwe                                                  54
Ecuador                                                   54
Ireland                                                   53
Belgium                                                   51
Romania                                                   50
Czech Republic                                            49
I do not wish to disclose my location                     42
Name: count, dtype: int64
---------------------------------------------------------------------
                                              Country  Count
0                                               India   8792
1                            United States of America   2920
2                                               Other   1430
3                                              Brazil    833
4                                             Nigeria    731
5                                            Pakistan    620
6                                               Japan    556
7                                               China    453
8                                               Egypt    383
9                                              Mexico    380
10                                          Indonesia    376
11                                             Turkey    345
12                                             Russia    324
13                                        South Korea    317
14                                             France    262
15  United Kingdom of Great Britain and Northern I...    258
16                                              Spain    257
17                                             Canada    257
18                                           Colombia    256
19                                         Bangladesh    251
20                                             Taiwan    242
21                                           Viet Nam    212
22                                          Argentina    204
23                                              Kenya    201
24                                              Italy    182
25                                            Morocco    177
26                                          Australia    142
27                                           Thailand    132
28                                            Tunisia    125
29                                               Peru    121
30                       Iran, Islamic Republic of...    120
31                                              Chile    115
32                                             Poland    113
33                                       South Africa    109
34                                        Philippines    108
35                                        Netherlands    108
36                                              Ghana    107
37                                             Israel    102
38                                            Germany     99
39                                           Ethiopia     98
40                               United Arab Emirates     94
41                                           Portugal     87
42                                       Saudi Arabia     84
43                                            Ukraine     79
44                                          Sri Lanka     77
45                                              Nepal     75
46                                           Malaysia     74
47                                          Singapore     68
48                                           Cameroon     68
49                                            Algeria     62
50                                 Hong Kong (S.A.R.)     58
51                                           Zimbabwe     54
52                                            Ecuador     54
53                                            Ireland     53
54                                            Belgium     51
55                                            Romania     50
56                                     Czech Republic     49
57              I do not wish to disclose my location     42

3-1-4 Job Positions (Q23) Distribution¶

I believe everyone is very interested in this part. That is, as we are in the field of big data, what are the future career options?

In [ ]:
 
In [54]:
Num_Q23 = data_1["Q23"].value_counts()
print(Num_Q23)

Num_Q23_2=pd.DataFrame({'Job':Num_Q23.index,'Count':Num_Q23.values})
# print(Num_Q23_2)

fig=px.bar(Num_Q23_2, x='Job', y='Count', color='Job', template='seaborn', text='Count', title='<b>Job')

fig.show()
Q23
Data Scientist                                                      1929
Data Analyst (Business, Marketing, Financial, Quantitative, etc)    1538
Currently not employed                                              1432
Software Engineer                                                    980
Teacher / professor                                                  833
Manager (Program, Project, Operations, Executive-level, etc)         832
Other                                                                754
Research Scientist                                                   593
Machine Learning/ MLops Engineer                                     571
Engineer (non-software)                                              465
Data Engineer                                                        352
Statistician                                                         125
Data Architect                                                        95
Data Administrator                                                    70
Developer Advocate                                                    61
Name: count, dtype: int64

3-1-5 Distribution of work areas (Q24)¶

As mentioned in the introduction, machine learning is a typical multi-disciplinary subject. So what specific fields does it involve?

In [58]:
Num_Q24 = data_1["Q24"].value_counts()
print(Num_Q24)


fig=px.pie(values=Num_Q24.values, names=Num_Q24.index, hole=0.4, template='plotly') 

fig.update_traces(textposition='inside', textinfo='percent+label')

fig.add_annotation(dict(x=0.5, y=0.5,  align='center',
                        xref = "paper", yref = "paper",
                         showarrow = False, font_size=22,
                        text="<b>Job Area</b>"))

fig.show()
Q24
Computers/Technology                      2321
Academics/Education                       1447
Accounting/Finance                         802
Other                                      750
Manufacturing/Fabrication                  561
Medical/Pharmaceutical                     509
Government/Public Service                  500
Online Service/Internet-based Services     461
Retail/Sales                               398
Energy/Mining                              320
Insurance/Risk Assessment                  256
Marketing/CRM                              246
Non-profit/Service                         194
Broadcasting/Communications                179
Shipping/Transportation                    150
Name: count, dtype: int64

3-1-6 Salary (Q29) distribution¶

This brings us to the point that we are most concerned about. How much wealth can we create for ourselves? After looking at this distribution, we can't help but want to find a position for ourselves. What factors do we need to make us high-paid talents? This will be the question we will discuss later.

In [62]:
Num_Q29 = data_1["Q29"].value_counts()
print(Num_Q29)

# print("---------------------------------------------------------------------")

Num_Q29_2=pd.DataFrame({'Salary':Num_Q29.index,'Count':Num_Q29.values})

fig=px.bar(Num_Q29_2, x='Salary', y='Count', color='Salary', template='seaborn', text='Count', title='<b>Salary')

fig.show()
Q29
$0-999              1112
10,000-14,999        493
30,000-39,999        464
1,000-1,999          444
40,000-49,999        421
100,000-124,999      404
5,000-7,499          391
50,000-59,999        366
7,500-9,999          362
150,000-199,999      342
20,000-24,999        337
60,000-69,999        318
15,000-19,999        299
70,000-79,999        289
25,000-29,999        277
2,000-2,999          271
125,000-149,999      269
3,000-3,999          244
4,000-4,999          234
80,000-89,999        222
90,000-99,999        197
200,000-249,999      155
250,000-299,999       78
300,000-499,999       76
$500,000-999,999      48
>$1,000,000           23
Name: count, dtype: int64

3-1-7 Educational background (Q8) distribution¶

As the computer field is a field that produces many high-end talents, what is the educational level of everyone?

In [66]:
Num_Q8 = data_1["Q8"].value_counts()
print(Num_Q8)

fig=px.pie(values=Num_Q8.values, names=Num_Q8.index, hole=0.4, template='plotly_white') 

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.update_layout(
    title={   
        "text":"<b>Academic Qualification<b>",
        "y":0.96,  # y
        "x":0.4,  # x 
    }
)

fig.show()
Q8
Master’s degree                                                      9142
Bachelor’s degree                                                    7625
Doctoral degree                                                      2657
Some college/university study without earning a bachelor’s degree    1431
I prefer not to answer                                               1394
Professional doctorate                                                585
No formal education past high school                                  564
Name: count, dtype: int64

3-2 ML&DS related data distribution diagram¶

We have just analyzed the basic situation of industry insiders, and now we come to a more detailed analysis of data issues that are highly relevant to the profession.

When programming every day, you may wonder, will I use the software I use now, the languages ​​I write, the platforms I run, the libraries I call, the algorithms I use, etc. in the future?

To put it more simply, when we first entered school, we were all curious about why we learned advanced mathematics. We don’t need advanced mathematics to do calculations when we go out to buy vegetables. Do we still need to use calculus to calculate bean sprouts?

In order to solve such doubts, let’s take a look at the answers of the big guys who have been programming for different periods of time to the above questions.

3-2-1 Distribution of time spent using machine learning methods (Q16)¶

In [71]:
Num_Q16 = data_1["Q16"].value_counts()
print(Num_Q16)

fig=px.pie(values=Num_Q16.values, names=Num_Q16.index, hole=0.4, template='ggplot2') 

fig.update_traces(textposition='inside', textinfo='percent+label')

fig.update_layout(
    title={   
        "text":"Time distribution using machine learning methods",
        "y":0.96,  # y
        "x":0.4,  # x
    }
)

fig.show()
Q16
Under 1 year                             7221
1-2 years                                3720
I do not use machine learning methods    3419
2-3 years                                1947
5-10 years                               1090
3-4 years                                1053
4-5 years                                 950
10-20 years                               483
20 or more years                            3
Name: count, dtype: int64

3-2-2 Programming time (Q11) distribution¶

In [74]:
Num_Q11 = data_1["Q11"].value_counts()
print(Num_Q11)


fig=px.pie(values=Num_Q11.values, names=Num_Q11.index, hole=0.4, template='ggplot2') 

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.update_layout(
    title={   
        "text":"Programming time distribution",
        "y":0.96,  # y
        "x":0.45,  # x
    }
)

fig.show()
Q11
1-3 years                    6459
< 1 years                    5454
3-5 years                    3399
5-10 years                   2556
I have never written code    2037
10-20 years                  1801
20+ years                    1537
Name: count, dtype: int64

3-2-3 Usage of learning platform (Q6)¶

Let’s first take a look at which of these learning platforms are the most popular, so that we can develop a new learning platform based on CSDN in the future and learn more applicable knowledge.

In [78]:
data_1["Q6_1"].value_counts()  #9699
Out[78]:
Q6_1
Coursera    9699
Name: count, dtype: int64
In [16]:
platform=['Coursera','edX','Kaggle Learn Courses','DataCamp','Fast.ai','Udacity','Udemy','LinkedIn Learning','Cloud-certification programs','University Courses','None','Other']
count=[9699,2474,6628,3718,944,2199,6116,2766,1821,6780,2643,5669]
df2=pd.DataFrame({'Platform':platform,'Count':count})
df2=df2.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(df2,x='Platform',y='Count',color='Platform',text='Count',template='simple_white',title='<b>Platforms used by Kagglers for completing Data Science Courses')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

We found that the top four most popular learning platforms are shown in the figure. Let’s take a look at the distribution of these most popular platforms among “newbies” and “veterans”.

In [17]:
a1=data_1.groupby(['Q6_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Platform']=a1['Q6_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q6_10','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['Platform']=a2['Q6_10']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q6_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['Platform']=a3['Q6_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q6_7','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['Platform']=a4['Q6_7']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Coursera<em>','<em>University Courses','<em>Kaggle Learn Courses', '<em>Udemy'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Platform']=='Coursera']['Count'],name='Coursera',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['Platform']=='University Courses (resulting in a university degree)']['Count'],name='University Courses',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['Platform']=='Kaggle Learn Courses']['Count'],name='Kaggle Learn Courses',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['Platform']=='Udemy']['Count'],name='Udemy',text=a4['Count']),row=2,col=2)

3-2-4 Programming language (Q12) usage¶

The major programming languages ​​​​run almost throughout our first two years of study. Perhaps you have also wondered, are these languages ​​​​very commonly used languages ​​​​?

In [18]:
pl=['Python','R','SQL','C','C#','C++','Java','Javascript','Bash','PHP','MATLAB','Julia','Go','None','Other']
count=[18653,4571,9620,3801,1473,4549,3862,3489,1674,1443,2441,296,322,256,1342]
df7=pd.DataFrame({'Programming Language':pl,'Count':count})
df7.sort_values(by='Count',ascending=False,inplace=True)
df7.reset_index(drop=True)
fig=px.bar(df7,x='Programming Language',y='Count',color='Programming Language',template='simple_white',text='Count',title='<b>What programming languages do Kagglers use on a regular basis?</b>')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

Through the visualization of question Q12, we found that we are suffering when learning various languages, but the most used language at present is Python. If you want to become a talent, you still have to write Python.

In [15]:
a1=data_1.groupby(['Q12_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Language']=a1['Q12_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q12_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['Language']=a2['Q12_3']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q12_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['Language']=a3['Q12_2']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q12_6','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['Language']=a4['Q12_6']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Python<em>','<em>SQL ','<em>R', '<em>C++'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Language']=='Python']['Count'],name='Python',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['Language']=='SQL']['Count'],name='SQL',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['Language']=='R']['Count'],name='R',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['Language']=='C++']['Count'],name='C++',text=a4['Count']),row=2,col=2)

Let's take a look at how well programmers with different programming time master these codes. We find that Python is indispensable among programmers with a short programming time.

3-2-5 Integrated development environment (Q13) usage¶

I believe that everyone has downloaded a lot of software on their computers, and has configured one environment after another. Let's take a look at which integrated development environment is the most popular.

In [20]:
tool=['JupyterLab','RStudio','Visual Studio','Visual Studio Code (VSCode)','PyCharm','Spyder','Notepad++','Sublime Text','Vim / Emacs','MATLAB','Jupyter Notebook','IntelliJ','None','Other']
count=[4887,3824,4416,9976,6099,2880,3891,2218,1448,2302,13684,1612,409,1474]
tools=pd.DataFrame({'Tool':tool,'Count':count})
tools=tools.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(tools,x='Tool',y='Count',color='Tool',text='Count',template='simple_white',title='<b>IDE\'s Used by Kagglers on regular a basis</b>')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

After the visualization of question Q13, we found that jupyter notebook won the top spot without any suspense. When we use jupyter, it provides a very convenient one-code-block-one-run function, which saves the trouble of calling many different files. The visualization results can also be displayed directly. It is powerful and well-deserved.

In [21]:
a1=data_1.groupby(['Q13_11','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Ide']=a1['Q13_11']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q13_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ide']=a2['Q13_4']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q13_5','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ide']=a3['Q13_5']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q13_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ide']=a4['Q13_1']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Jupyter Notebook<em>','<em>Visual Studio Code (VSCode)','<em>PyCharm', '<em>JupyterLab'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Ide']==' Jupyter Notebook']['Count'],name='Jupyter Notebook',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ide']==' Visual Studio Code (VSCode) ']['Count'],name='Visual Studio Code (VSCode)',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ide']==' PyCharm ']['Count'],name='PyCharm',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ide']=='JupyterLab ']['Count'],name='JupyterLab',text=a4['Count']),row=2,col=2)

We have observed that in the eyes of programmers of different qualifications, Jupyter Notebook is still the top choice. It is friendly to novices and can also meet the needs of programming experts.

3-2-6 Usage of data visualization libraries (Q15)¶

Data visualization is an indispensable part of data analysis. Irregular and cold data can be understood by everyone through visualization, regardless of whether they have basic programming skills or not. This is essential for making reports and discussing projects with bosses clearly.

In [16]:
library=['Matplotlib','Seaborn','Plotly / Plotly Express','Ggplot / ggplot2','Shiny','D3 js','Altair','Bokeh','Geoplotlib','Leaflet / Folium','Pygal','Dygraphs','Highcharter','None','Other']
count=[14010,10512,5078,4145,1043,734,300,771,1167,554,318,225,198,3439,691]
data=pd.DataFrame({'Library':library,'Count':count})
data=data.sort_values(by='Count',ascending=False).reset_index (drop=True)
fig=px.bar(data,x='Library',y='Count',template='simple_white',color='Library',text='Count',title='<b>Data visualization libraries used by kagglers on a regular basis</b>')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

Through the visualization of question Q15, we found that matplotlib is the most commonly used data visualization library, which includes various common image drawing functions.

In [23]:
a1=data_1.groupby(['Q15_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['dv']=a1['Q15_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q15_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['dv']=a2['Q15_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q15_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['dv']=a3['Q15_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q15_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['dv']=a4['Q15_4']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Matplotlib<em>','<em>Seaborn','<em>Plotly / Plotly Express', '<em>Ggplot / ggplot2'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['dv']==' Matplotlib ']['Count'],name='Matplotlib',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['dv']==' Seaborn ']['Count'],name='Seaborn',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['dv']==' Plotly / Plotly Express ']['Count'],name='Plotly / Plotly Express',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['dv']==' Ggplot / ggplot2 ']['Count'],name='Ggplot / ggplot2',text=a4['Count']),row=2,col=2)

Similar to the integrated environment, matplotlib is also widely popular among novices and veterans in data visualization, followed by seaborn

3-2-7 Machine Learning Framework (Q17) Usage¶

The machine learning framework guides us to learn about this subject. In the hands of practitioners, it is an indispensable helper. Through training and learning from various existing models, various results can be obtained for different problems. The machine learning framework has an immeasurable impact on students and practitioners in terms of depth and breadth.

In [24]:
work=['Scikit-learn','TensorFlow','Keras','PyTorch','Fast.ai','Xgboost','LightGBM','CatBoost','Caret','Tidymodels','JAX','PyTorch Lightning','Huggingface','None','Other']
count=[11403,7953,6575,5191,648,4477,1940,1165,821,547,252,1013,1332,1709,620]
d1=pd.DataFrame({'Frameworks':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Frameworks',y='Count',color='Frameworks',text='Count',template='simple_white',title='<b>Machine learning frameworks used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

By analyzing the visualization of question Q17, we can find that sklearn is the most commonly used machine learning framework, which contains a variety of models. In this semester's statistical learning theory course, we also learned many models in this library.

In [25]:
a1=data_1.groupby(['Q17_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q17_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q17_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q17_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q17_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q17_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q17_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q17_4']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Scikit-learn<em>','<em>TensorFlow','<em>Keras', '<em>PyTorch'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='  Scikit-learn ']['Count'],name='Scikit-learn',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='  TensorFlow ']['Count'],name='TensorFlow',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']==' Keras ']['Count'],name='Keras',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']==' PyTorch ']['Count'],name='PyTorch',text=a4['Count']),row=2,col=2)

sklearn contains a wide range of content and is suitable for all kinds of people. Whether it is the first time or a proficient user, the sklearn library can meet all kinds of needs, and it can be applied in many fields and types.

3-2-8 Use of machine learning algorithms (Q18)¶

Machine learning algorithms are closely connected with mathematical derivation. In class, we explained the derivation process of models such as logistic regression, decision tree, Bayesian classifier, etc. Only after understanding them from a mathematical level can we select the appropriate model for training when applying them. This is very important.

In [26]:
work=['Linear or Logistic Regression','Decision Trees or Random Forests','Gradient Boosting Machines (xgboost, lightgbm, etc)','Bayesian Approaches','Evolutionary Approaches','Dense Neural Networks (MLPs, etc)','Convolutional Neural Networks','Generative Adversarial Networks','Recurrent Neural Networks','Transformer Networks (BERT, gpt-3, etc)','Autoencoder Networks (DAE, VAE, etc)','Graph Neural Networks','None','Other']
count=[11338,9373,5506,3661,823,3476,6006,1166,3451,2196,1234,1422,1326,538]
d1=pd.DataFrame({'Algorithms':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Algorithms',y='Count',color='Algorithms',text='Count',template='simple_white',title='<b>ML algorithms used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

I found that linear or logistic regression is the most commonly used machine learning algorithm. Many real-world problems are linear problems. Some multi-classification problems can also be transformed into the accumulation of multiple logistic regressions. As taught in class, the sum of multiple binary classification problems is a multi-classification problem.

In [27]:
a1=data_1.groupby(['Q18_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q18_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q18_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q18_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q18_7','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q18_7']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q18_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q18_3']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Linear or Logistic Regression<em>','<em>Decision Trees or Random Forests','<em>Convolutional Neural Networks', '<em>Gradient Boosting Machines'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='Linear or Logistic Regression']['Count'],name='Linear or Logistic Regression',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='Decision Trees or Random Forests']['Count'],name='Decision Trees or Random Forests',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Convolutional Neural Networks']['Count'],name='Convolutional Neural Networks',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='Gradient Boosting Machines (xgboost, lightgbm, etc)']['Count'],name='Gradient Boosting Machines',text=a4['Count']),row=2,col=2)

I found that linear and logistic regression problems are the most commonly used by beginners. I would guess that this is because linear problems are simpler than nonlinear ones. For programming veterans, I think it is because many problems can be solved using linear logistic regression, or dividing difficult problems makes them simpler and easier to handle.

3-2-9 Use of computer vision methods (Q19)¶

The application of computer vision is very extensive. It has many applications in medicine, agriculture, driving and other fields. Among the computer vision methods, let's take a look at the most commonly used methods.

In [28]:
work=['General purpose image/video tools','Image segmentation methods','Object detection methods','Image classification and other general purpose networks','Generative Networks','Vision transformer networks','None','Other']
count=[2293,2495,2525,3664,1343,782,1455,146]
d1=pd.DataFrame({'Vision Methods':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Vision Methods',y='Count',color='Vision Methods',text='Count',template='simple_white',title='<b>Computer vision methods used on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()

Image classification and other general purpose networks are the most used computer vision methods. Image classification has many practical applications in our other artificial intelligence course. After understanding the relationship between artificial intelligence, machine learning, and deep learning, the learning process will be very clear.

In [17]:
a1=data_1.groupby(['Q19_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q19_4']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q19_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q19_3']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q19_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q19_2']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q19_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q19_1']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Image classification<em>','<em>Object detection methods','<em>Image segmentation methods', '<em>General purpose image/video tools'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)']['Count'],name='Image classification(VGG, Inception, ResNet)',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='Object detection methods (YOLOv6, RetinaNet, etc)']['Count'],name='Object detection methods (YOLOv6, RetinaNet, etc)',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Image segmentation methods (U-Net, Mask R-CNN, etc)']['Count'],name='Image segmentation methods(U-Net, Mask R-CNN)',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='General purpose image/video tools (PIL, cv2, skimage, etc)']['Count'],name='General purpose image/video tools(PIL, cv2)',text=a4['Count']),row=2,col=2)

3-2-10 Cloud platform (Q32) usage¶

The amount of data is always a problem. Using products from major cloud platforms facilitates data storage and retrieval.

In [107]:
Num_Q32 = data_1["Q32"].value_counts()
print(Num_Q32)


fig=px.pie(values=Num_Q11.values, names=Num_Q11.index, hole=0.4, template='ggplot2') 

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.update_layout(
    title={   
        "text":"Cloud platform usage distribution",
        "y":0.96,  # y轴数值
        "x":0.45,  # x轴数值 
    }
)

fig.show()
Q32
 Amazon Web Services (AWS)                                 555
 Google Cloud Platform (GCP)                               501
They all had a similarly enjoyable developer experience    443
 Microsoft Azure                                           256
None were satisfactory                                      72
 IBM Cloud / Red Hat                                        34
 Oracle Cloud                                               20
Other                                                       20
 VMware Cloud                                               12
 SAP Cloud                                                   7
 Alibaba Cloud                                               5
 Tencent Cloud                                               3
 Huawei Cloud                                                1
Name: count, dtype: int64

3-2-11 Database (Q35) usage¶

The use of databases has a great relationship with our machine learning. During the use of databases, we can observe the characteristics of data and perform operations such as screening and selecting data features.

In [31]:
work=['MySQL','PostgreSQL ','SQLite ','Oracle Database ','MongoDB ','Snowflake ','IBM Db2 ','Microsoft SQL Server ','Microsoft Azure SQL Database ','Amazon Redshift ','Amazon RDS ','Amazon DynamoDB ','Google Cloud BigQuery ','Google Cloud SQL ','None','Other']
count=[2233,1516,1159,688,1031,399,192,1203,520,380,505,356,690,439,955,217]
d1=pd.DataFrame({'Database':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Database',y='Count',color='Database',text='Count',template='simple_white',title='<b>Data product used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
In [18]:
a1=data_1.groupby(['Q35_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q35_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q35_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q35_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q35_8','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q35_8']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q35_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q35_3']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']

fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>MySQL<em>','<em>PostgreSQL ','<em>Microsoft SQL Server ', '<em>SQLite '))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='MySQL ']['Count'],name='MySQL',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='PostgreSQL ']['Count'],name='PostgreSQL ',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Microsoft SQL Server ']['Count'],name='Microsoft SQL Server ',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='SQLite ']['Count'],name='SQLite ',text=a4['Count']),row=2,col=2)

4. Data prediction¶

After reading various data visualization analyses, we have a better understanding of the basic situation of practitioners. So don't forget that the original intention of our analysis this time is actually a more practical issue.

No matter what, supporting a family is the top priority. Everyone says that working in the computer industry is easy to make money, so let's follow the data to dig deep into the industry situation. After all, this is the last step in our talent creation.

Knowledge is important, but knowledge that can create gold is more important. So what kind of knowledge can be related to our creation of gold?

What characteristics or elements do we need to have?

Let's take these questions to find out!

4-1 Selection of independent and dependent variables¶

4-1-1 Dependent variable - transformation of salary field¶

Salary is used as the dependent variable, but we have observed a problem. This salary is a series of numerical segments, and the intervals between each numerical segment are not equally spaced. If it is made into a classification problem, the data segment has three main problems: large interval differences, it cannot contain all salaries, and people do not have an intuitive concept of their own salary.

When we consider the problem as a regression problem, the above problems can be effectively alleviated.

In [19]:
#Replace the original salary range with the mean$0-999, 200,000-249,999, >$1,000,000
data_1['Q29'] = data_1['Q29'].str.replace('$','').str.replace(',','').str.replace('>','1000000-')

data_1[['Sal_l', 'Sal_h']]=data_1['Q29'].str.split('-', n=1, expand=True)#n:分割次数   expand:扩展为Dataframe
# data_1.dtypes

data_1['Sal_l'] = pd.to_numeric(data_1['Sal_l'])
data_1['Sal_h'] = pd.to_numeric(data_1['Sal_h'])
data_1['Salary'] = round((data_1['Sal_l'] + data_1['Sal_h']) / 2) 
# print(data_1['Salary'])
data_1.drop(["Sal_l","Sal_h","Q29"], axis=1, inplace=True)

Remove outliers

At the beginning of the training process, we did not consider the issue of removing outliers, which led to poor training results. After consulting relevant materials, we found that outliers would affect the training process of the model, and "bad apples" should be removed to prevent outliers from affecting the training results in the subsequent process.

In [20]:
fig = px.box(
      data_1, 
      y="Salary" ,
      points="all" 
)

fig.show()
In [40]:
def outlier(data):
    df = data
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3-Q1
    Obs_min = Q1-1.5*IQR
    Obs_max = Q3+1.5*IQR
    result = (df < Obs_min) | (df > Obs_max)#bool
    index = df[result].index #(Int64Index)
#     print(index)
    df = pd.DataFrame(df)
    df[~((df < (Obs_min)) | (df > (Obs_max)))]
   
    return index

# outlier(data_1['Salary'])
data_1.drop(list(outlier(data_1.Salary)), axis=0, inplace=True)#23617 
In [41]:
# data_1
data_1 = data_1.reset_index(drop=True)

4-2-2 Data processing¶

Now that we have dealt with the dependent variable, let's take a look at the characteristics of the independent variable.

Our data set consists of single-choice questions and multiple-choice questions, so correspondingly, multiple-choice and single-choice questions should have different processing methods.

Deduplication

In [43]:
data_1.drop_duplicates(inplace=True)
data_1.shape
Out[43]:
(23617, 296)

About handling missing values

When we thought about this problem, we found that it was not easy to solve. For multiple-choice questions, each person filling out the form has questions that they did not fill out, so the missing values ​​cannot be removed casually. The missing values ​​of multiple-choice questions are considered to be 0, and the missing values ​​of single-choice questions are solved by the unique hot encoding below.

coding

In [42]:
def column_name(name):
    return [col for col in data_1.columns if name in col]

data = column_name('_')
data_cod = data_1[data]
In [44]:
data_c = ['Q2','Q3','Q4','Q5','Q8','Q9','Q11','Q16','Q22','Q23','Q24','Q25','Q26','Q27','Q30','Q32','Q43']

data_1 = pd.get_dummies(data=data_1, columns=data_c)
data_1.head()
Out[44]:
Duration (in seconds) Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 Q6_6 Q6_7 Q6_8 Q6_9 ... Q32_ Tencent Cloud Q32_ VMware Cloud Q32_None were satisfactory Q32_Other Q32_They all had a similarly enjoyable developer experience Q43_2-5 times Q43_6-25 times Q43_More than 25 times Q43_Never Q43_Once
0 121 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0 0 0 0 0 0 0 0 0 0
1 462 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0 0 0 0 0 0 0 0 0 0
2 293 Coursera edX NaN DataCamp NaN Udacity Udemy LinkedIn Learning NaN ... 0 0 0 0 0 0 0 0 0 0
3 851 Coursera NaN Kaggle Learn Courses NaN NaN NaN Udemy NaN NaN ... 0 0 0 0 0 1 0 0 0 0
4 232 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0 0 0 0 0 0 0 0 0 0

5 rows × 461 columns

8341dc5f8499776ca3731d870e780a6.png

We use the function shown below to select multiple-choice questions

In [45]:
#multiple
listem =[]
for i in data_cod.columns:
    listem.append(data_1[i].dropna().unique()[0]) 
data_1[data] = data_1[data].replace(np.nan,0).replace(listem,1) 
data_1.head()
Out[45]:
Duration (in seconds) Q6_1 Q6_2 Q6_3 Q6_4 Q6_5 Q6_6 Q6_7 Q6_8 Q6_9 ... Q32_ Tencent Cloud Q32_ VMware Cloud Q32_None were satisfactory Q32_Other Q32_They all had a similarly enjoyable developer experience Q43_2-5 times Q43_6-25 times Q43_More than 25 times Q43_Never Q43_Once
0 121 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 462 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 293 1 1 0 1 0 1 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
3 851 1 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 232 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 461 columns

6892f547d5153ebf94d94d1e134343d.png

4-2-3 Independent variable selection¶

In [61]:
data_1 = data_1.dropna(subset=['Salary'], how='any')#how: any’, all’, default ‘any’
X = data_1.drop('Salary', axis=1)
y = data_1['Salary']
print(y.shape, X.shape)
# X.to_excel("bigwork.xlsx")
(7756,) (7756, 460)

After processing the data, I found that there are many redundant variables piled up here. So which variables should we use for prediction?

During the learning processIwe found a function SelectKBest function, which perfectly solvethisur problem

image.png

You can see that SelectKBest has two parameters, one is score_func and the other is k. We can understand that score_func is a function that scores features and then selects features from high to low. So how many features should be selected? The k at the end is to limit the number of features, and the default is to select 10 features.

In [47]:
#SelectKBest
from sklearn.feature_selection import f_regression
fs  = SelectKBest(score_func=f_regression, k='all')#score_func=mutual_info_regression:
                                                        
fit = fs.fit(X, y)

feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_ 
top20_feature = feature_imp.nlargest(n=20, columns=['Score'])

plt.figure(figsize=(8,5))
g = sns.barplot(y=top20_feature.index, x=top20_feature['Score'])
p = plt.title('Top 20 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
No description has been provided for this image

After selecting the first 20 data, we want to see what the relationship is between these 20 data. Visualization is the most intuitive and fastest way.

In [48]:
#Dispersed Color
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top20_feature.index].corr()

mask = np.triu(np.ones_like(corr, dtype=np.bool))
#ones_like: Returns an array with the same shape and type as the given array triu: Returns the Mask corresponding to the upper triangular matrix
g = sns.heatmap(corr, annot=True,  vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
No description has been provided for this image

In the subsequent prediction process, we found that too many variables are also a headache, because this will produce more influencing factors that reduce the prediction effect.

So when selecting features, we changed the k variables and found the optimal number of variables from the top5 variables, top10 variables, and top20 variables.

In [49]:
from sklearn.feature_selection import f_regression
fs  = SelectKBest(score_func=f_regression, k='all')
fit = fs.fit(X, y)

feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_ :返回每个特征的得分
top5_feature = feature_imp.nlargest(n=5, columns=['Score'])

plt.figure(figsize=(8,5))
g = sns.barplot(y=top5_feature.index, x=top5_feature['Score'])
p = plt.title('Top 5 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
No description has been provided for this image
In [50]:
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top5_feature.index].corr()

mask = np.triu(np.ones_like(corr, dtype=np.bool))
g = sns.heatmap(corr, annot=True,  vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
No description has been provided for this image
In [51]:
from sklearn.feature_selection import f_regression
fs  = SelectKBest(score_func=f_regression, k='all')
fit = fs.fit(X, y)# Pass in feature set x and label y to fit the data.

feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_ 
top10_feature = feature_imp.nlargest(n=10, columns=['Score'])

plt.figure(figsize=(8,5))
g = sns.barplot(y=top10_feature.index, x=top10_feature['Score'])
p = plt.title('Top 10 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
No description has been provided for this image
In [52]:
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top10_feature.index].corr()

mask = np.triu(np.ones_like(corr, dtype=np.bool))
g = sns.heatmap(corr, annot=True,  vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
No description has been provided for this image

As can be seen from the above figure, the correlation between some independent variables is very high, so we removed two independent variables, as shown in the following code

In [53]:
# X = X[top20_feature.index]
# list(X.columns)

# X = X[top5_feature.index]
# list(X.columns)

X = X[top10_feature.index]
X.drop(['Q33_1','Q34_3'], axis=1, inplace=True)
list(X.columns)
Out[53]:
['Q4_United States of America',
 'Q4_India',
 'Q28_3',
 'Q27_We have well established ML methods (i.e., models in production for more than 2 years)',
 'Q16_5-10 years',
 'Q11_20+ years',
 'Q31_1',
 'Q28_5']

4-2 Dataset Partitioning¶

In [132]:
# Divide the dataset
from sklearn.preprocessing import MinMaxScaler#min-max
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
# scaler = MinMaxScaler() 
# scaler = scaler.fit(X) 
# X = scaler.transform(X) 

# scaler = StandardScaler() 
# X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15) 

min_max + test_size=0.15

4-3 Model Training¶

In [133]:
#Random Forest
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("RandomForestRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
RandomForestRegressor结果如下:
训练集分数: 0.9401072528160039
验证集分数: 0.5972944686721949
In [134]:
#Linear Regression
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("LinearRegression结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
LinearRegression结果如下:
训练集分数: 0.6464430857938157
验证集分数: -3.861692249061065e+25
In [135]:
#Ridge Regression
from sklearn.linear_model import Ridge
# x_train,x_test,y_train,y_test = train_test_split(a,b,test_size=0.2)
clf = Ridge()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("Ridge结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
Ridge结果如下:
训练集分数: 0.6465647424733079
验证集分数: 0.6090348828938872
In [136]:
#Lasso Regression
from sklearn.linear_model import Lasso

clf = Lasso()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("Lasso结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
Lasso结果如下:
训练集分数: 0.646564058829653
验证集分数: 0.6091747030930386
In [137]:
#Decision Tree
from sklearn.tree import DecisionTreeRegressor

clf = DecisionTreeRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("DecisionTreeRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
DecisionTreeRegressor结果如下:
训练集分数: 1.0
验证集分数: 0.18094718841028012
In [138]:
#Bagging Regression Model
from sklearn.ensemble import BaggingRegressor

clf = BaggingRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("BaggingRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
BaggingRegressor结果如下:
训练集分数: 0.9170855188835699
验证集分数: 0.5450183800537018

5 Results Analysis¶

From the above training results, we can see that the training effects of various models are poor. I think the possible reasons are as follows:

  1. The data volume is not very large. After removing the missing values ​​of salary, there are only more than 7,000 data, and the data volume is relatively small

  2. In this prediction, we use the mean to replace each salary field. The salary field range is large. Using the mean to replace itself may have a large error and a low accuracy rate

  3. By observing some models, it can be found that the accuracy rate on the training set is very high, while it is very low on the validation set. It is speculated that there may be overfitting.

In [ ]: